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ABSTRACT 

This paper presents an algorithm to automatically design 
two-level fat-tree networks, such as ones widely used in large- 
scale data centres and cluster supercomputers. The two lev- 
els may each use a different type of switches from design 
database to achieve an optimal network structure. Links be- 
tween layers can run in bundles to simplify cabling. Several 
sample network designs are examined and their technical 
and economic characteristics are discussed. 

The characteristic feature of our approach is that real life 
equipment prices and values of technical characteristics are 
used. This allows to select an optimal combination of hard- 
ware to build the network (including semi-populated config- 
urations of modular switches) and accurately estimate the 
cost of this network. We also show how technical character- 
istics of the network can be derived from its per-port metrics 
and suggest heuristics for equipment placement. 

The algorithm is useful as a part of a bigger design procedure 
that selects optimal hardware of cluster supercomputer as 
a whole. Therefore the article is focused on the use of fat- 
trees for high-performance computing, although the results 
are valid for any type of data centres. 

Categories and Subject Descriptors 

C.2.1 [Computer-Communication Networks]: Network 
Architecture and Design — Network topology; K.6.2 [Mana- 
gement of Computing and Information Systems]: In- 
stallation Management — Computer selection 

General Terms 

Design, Economics 

Keywords 

Fat-tree network 

1. INTRODUCTION 

Parallel computers use many types of networks to intercon- 
nect its computing elements. Frequently used topologies in- 
clude stars, meshes, tori and trees. 

"Beowulf-style cluster supercomputers often employ fat- 
tree topologies built using readily available off-the-shelf In- 
finiBand hardware. We describe an algorithm that allows 
to automatically design fat-tree networks with a variety of 
objective functions, with the most obvious example being 



the total cost of network. The algorithm is implemented in 
a software tool [15]. 

This algorithm is intended to be used as a part of a CAD sys- 
tem for cluster supercomputers [16]. Such a system would 
iterate through different combinations of hardware, vary- 
ing the number of compute nodes and other parameters. 
Thus, designing an interconnection network for every hard- 
ware combination under review is a self-contained and highly 
repetitive operation that must be performed efficiently. 

Many researchers of fat-tree networks concentrate on gen- 
eral properties of such networks and big fabrics that could 
be built using them. We focus on real-life scenarios, tai- 
loring network designs to the number of network endpoints 
and available switches. For example, our approach allows to 
select optimal configurations of modular switches, with just 
the right number of leaf modules installed. 

Current ASIC technology enabled the appearance of read- 
ily available, off-the-shelf InfiniBand switches with P = 36 
ports. This allows to build two-level fat-trees with as much 
as P 2 /2 = 648 cluster nodes. For many typical installations 
this is enough. 

However, vendors also provide high-radix modular switches, 
which internally implement a two-level fat-tree. Switches 
with up to P = 648 ports (in non-blocking configurations) 
are available, hence networks with more than 200K nodes 
can be built with the proposed algorithm - this far exceeds 
the demands of even the most powerful today's supercom- 
puters. 

On the other hand, intermediate-sized designs that do not 
use the full capacity provided by switches tend to have un- 
used ports unless designed carefully. If no network expan- 
sion is anticipated, unused ports represent a waste of hard- 
ware resources. Therefore our algorithm tries to minimize 
the number of unused ports. Additionally, the algorithm 
reports if links between switches can run in bundles. Such 
bundles can be implemented with cables that aggregate mul- 
tiple links, e.g., 12x instead of 4x InfiniBand cables. This 
results in a lower number of cables and reduced cable bulk. 

To obtain cable costs, equipment placement and routing can 
be performed (not discussed in this article), or alternatively 
an "average" cable price can be used, see section 6.3 for de- 
tails. 



During the design process, other characteristics of intercon- 
nection networks, such as reliability, can be estimated and 
used as design constraints or as a part of a complex objective 
function. 

The rest of the article is organized as follows. Section 2 
describes existing work in the field of fat-tree networks and 
their economic issues. Section 3 introduces the main algo- 
rithm, and section 4 discusses it. In section 5 we conduct a 
sample run of the algorithm and present the results. Section 
6 explains how to obtain technical and economic character- 
istics of fat-trees using per-port metrics, discusses design 
for future expansion and proposes heuristics for equipment 
placement. Finally, section 7 concludes the article. 

2. RELATED WORK 

Fat-trees were initially introduced by C. Leiserson [9]. The 
mathematical formalism to describe their structure, "k-ary 
n-trees", was proposed by Petrini and Vanneschi [13]. Zahavi 
[17] further introduces two other formalisms for describing 
fat-trees, Parallel Ports Generalized Fat-Trees, where links 
between switches can run in parallel, and Real Life Fat-Trees 
where bandwidth between layers stays constant to guarantee 
content-free operation. 

A tool called NetWires [3] was created by H.G.Dietz as part 
of the bigger project Cluster Design Rules [2]. Netwires is 
able to design different types of interconnection networks, 
including trees, tori and a specific Flat Neighbourhood Net- 
work, using user-supplied parameters, and outputs a wiring 
diagram. Aside from the number of required switches, no 
other technical or economic characteristics are assessed. Our 
approach is different in that we only require a few input pa- 
rameters from the user, and iterate through other parame- 
ters automatically, trying to find a combination that yields 
an optimal value to a certain objective function subject to 
constraints. 

Gupta and Dally [5] suggested a tool to optimize network 
topology, in the broad class of hybrid Clos-torus networks. 
Cost, packaging and performance constraints can be speci- 
fied. This tool is most valuable for building custom inter- 
connection solutions when arbitrary topologies are feasible, 
contrary to the case of using commodity switches where most 
parameters are fixed but optimization can take actual prices 
into account. 

Al-Fares et al. [1] proposed to use fat-trees for generic data 
centre networks, using commodity hardware. Farrington et 
al. [4] followed up, suggesting to build a 3,456-port data 
centre switch with commodity chips ("merchant silicon") in- 
ternally connected in a fat-tree topology. They also advice 
to use optical fibre cables with as much as 72 or even 120 
separate fibres (strands) to minimize the volume and weight 
of cable bundles for inter-switch links. 

Mudigonda et al. [10] introduced Perseus, a framework to 
design fat-tree and HyperX topologies for data centres, and 
elaborated on cable tracing issues. However, fat-tree topolo- 
gies built by Perseus use identical switches on all levels. 

Parallel applications typically exhibit locality of communi- 
cations. Therefore in multi-level non-blocking fat-tree net- 



works the bandwidth offered by upper levels may remain 
underutilized. On intermediate levels, switch ports can be 
redistributed so that the number of links to the lower level 
is bigger than to the upper level. This reduces "fatness" of 
a tree, providing substantial hardware savings in terms of 
switches and links. 

Navaridas et al. [11] introduced such a reduced topology, 
thin-tree, and analysed its behaviour using simulation and 
several synthetic workloads. Overall, for the mix of work- 
loads, different configurations of the reduced topology were 
found beneficial in terms of "performance/cost" ratio com- 
pared to traditional fat-trees, especially when collective op- 
erations were only lightly used. They add, however, that 
in the absence of a topology-aware scheduler, neighbouring 
processes may be assigned to physically distant processing 
nodes, requiring full bandwidth at upper levels and thus 
rendering reduced topologies useless. 

Kamil et al. [6] similarly proposed a reduced topology, but 
used communication patterns of actual parallel applications 
for analysis. 

Kim et al. [7] introduced a flattened butterfly topology, pro- 
viding detailed analysis of cost breakdown for electrical ca- 
bles. 12x InfiniBand cables, aggregating three 4x links, were 
shown to be more economical than separate 4x cables and 
to additionally reduce cable bulk. Their subsequent work 
[8] compared cost models for electrical and active optical 
cables, showing that in 2008 prices, optical cables are less 
expensive starting from 10m. Parker and Scott [12] further 
advocate for the adoption of optical interconnects. 

Singla et al. [14] proposed to abstain from rigid network 
structures such as fat-trees, and connect switches in a ran- 
dom order, in a topology called Jellyfish. They found that 
with the same performance figures and the same network 
equipment as the fat-tree, their topology supports more servers 
(performance results were obtained via simulation with ran- 
dom permutation traffic). Another benefit of Jellyfish is the 
ability of incremental expansion. 

3. ALGORITHM 

Let us consider the algorithm to design fat-tree networks 
with two levels of switches (namely, edge and core layers). 
Suppose we have two databases, for edge and core switches, 
respectively, with each switch characterized primarily by the 
number of its ports. For the number of ports of a specific 
edge switch we will use the designation Pe, and for a core 
switch we will use Pc- 

Some switch models can be used for building both core and 
edge levels, and can be present in both databases. The two 
layers of network can employ different types of switches, but 
switches within the same layer are identical. 

Let E and C be the sets of edge and core switches. Each i-th 
switch is characterized by its model and the number of ports, 
e.g.: C = {{Mci, Pet)} ■ These sets are the algorithm's in- 
put. Their structure allows them to contain several models 
of switches with the same number of ports but with differing 
characteristics, such as cost, reliability, energy consumption, 
etc. 



For blade servers, which are installed into enclosures, edge- 
level switches are also installed in the same enclosures, and 
thus E usually contains only one element - a single switch, 
compatible with the enclosure. Ordinary rack-mounted servers 
can, on the contrary, use a variety of edge-level switches. 

Apart from E and C, other inputs for the algorithm are 
N, the number of compute nodes that need to be inter- 
connected, and Bl, the blocking (oversubscription) factor, 
which denotes the decrease in bandwidth available to com- 
pute nodes compared to a full, non-blocking fat-tree. 

The outputs of the algorithm are models of edge and core 
switches used to obtain the optimal design, as well as E and 
C, the number of edge and core switches, respectively, and 
/, the value of the objective function. 



Algorithm 1 Design a two-level fat-tree network 
Input: 

N: Number of nodes to interconnect 
Bl: Blocking factor 
E, C: Sets of edge and core switches 
Goal: Optimal network structure: 

E,C: Number of edge and core switches 
Bl r : Resulting blocking factor 
B: Number of links in a bundle 
L: Number of cables 

/: Objective function for the optimal network structure 
1: { First trivial case: } 

2: if (Using blade servers) and (Only two enclosures) 
then 

3: Trivial case 1: connect enclosures with cables 
4: Compute fi 
5: end if 

6: { Second trivial case: } 

7: if 3(M, P) £EuC : P > N then 

8: { If there exists a switch with N or more ports } 

9: Trivial case 2: use star network 
10: Compute / 2 
11: end if 

12: { Main loop: iterate through edge switches } 

13: for all edge switches (M Ez ,Pe z ) G E do 

14: P Eni <- [Pe, ■ (Bl/(l + Bl))\ { Ports to nodes } 

15: Peci Pei — f-Ejij { Ports to core level } 

16: Bl r <— PEni/PEci { Resulting blocking } 

17: Ei «- \N/P Eni ] { Number of edge switches } 

18: for all core switches (M Cj , Pcj ) £ C do 

19: if Pcj > Ei then { Core switch suitable } 

20: { Try core switch M Cj } 

21: B min (Pa, -f- E,PE Ci ) { Links in a bundle } 

22: C <- \P Ec JB] { Number of core switches } 

23: L<-N + Ei- P E c z { Number of cables } 

24: Compute f itj = f{Ei,Cj) 

25: end if 

26: end for 

27: end for 

28: Choose optimal combination of Mc and Me'- f-A = 

min fij 

29: Output optimal network structure: / = min(fi, f 2 , fs) 



In a fat-tree network with a blocking factor Bl and edge 
switches with P E ports, P En = [Pe ■ (52/(1 + Bl))] of 
those ports are used to connect compute nodes, and the 



rest are used to connect the edge switch to the core layer. 
Under these conditions, in order to connect all N nodes, 
E = \N/PEn] edge switches are required. 

The remaining ports on edge switches are connected to core 
layer switches. When building the core layer, each port on 
an edge switch is connected to a different core switch. It 
means that a core switch must have at least as many ports 
as there are edge switches. For example, first ports of all 
edge switches will connect to the same core switch. As a 
result, core switches must have at least Pcmin > E ports. 

Similarly, if a core switch has Pc > Pcmin ports, it can 
be connected to a maximum of Pc edge switches, each of 
those having P En compute nodes connected to it. Therefore, 
the maximum number of nodes that could be connected is 

Nmax = PC ■ Peti- 

The algorithm structure is as follows. First we check for two 
trivial cases where a full two-level fat-tree network is not 
required. Then we iterate through a set of edge switches. 
For every edge switch in a set, we evaluate multiple possible 
network designs, trying suitable core switches, and choose 
one of them. Finally, the best design over all iterations is 
selected. 

Let us describe the algorithm by stages. 

1. The first stage is to check for the trivial case of two 
blade enclosures (line 2). In this set-up, there are two 
enclosures of Pb/2 servers, each fitted with built-in 
edge switches with Pe ports. Half of the ports of each 
switch are connected to servers. The two switches can 
be directly connected together with Pb/2 cables, and 
a core level is not necessary. 

A cheaper alternative is to replace one of the switches 
with a "pass through" panel, also allowing to directly 
connect this enclosure's servers to the remaining sec- 
ond switch. 

(In case of rack-mounted servers, these complications 
are irrelevant, because two blocks of Ps/2 servers can 
always be connected with a single edge switch with Pe 
ports). 

We check if this configuration satisfies design constraints, 
and if yes, compute the value /i of the objective func- 
tion (in particular, expandability constraints could be 
violated). 

2. The second stage concerns the trivial case of a star 
network (line 7). If there exists a switch, in either E 
or C, with enough ports to accommodate all N nodes, 
it can be used to build a star network. If several such 
switches exist, we choose one. Similar to the above 
case, the value of the objective function, fe, is then 
computed. 

3. The main loop iterates over available edge switches 
using index i. 

(a) For every switch model, we calculate: PE ni , the 
number of ports that are connected to compute 
nodes (line 14), P Eci , the number of ports con- 
nected to the core level (line 15), Bl r , resulting 



blocking factor (line 16), and finally Ei, the num- 
ber of required edge switches (line 17). 

(b) We then iterate through all core switches using 
index j (line 18). If the number of ports on the 
core switch makes it suitable, we perform the fol- 
lowing actions. 

i. Calculate B, the number of links that run in 
parallel between edge and core switches (line 
21). 

The core level is built in the following way. 
We take one core switch. For every edge 
switch, we connect its first port to the core 
switch. As we have E edge switches, this 
operation will occupy E ports on the core 
switch. 

Now, we repeat this step several times until 
we run out of ports on the core switch. If this 
step is performed for a total of B times, then 
each of the edge switches becomes connected 
to the core switch with a bundle of B links. 
B can be obtained with a simple equation: 

B = P C] -rE. 

In certain rare cases with high blocking fac- 
tors (see Example 3 below), only one core 
switch is necessary to connect together all 
edge switches, and then links from all ports 
on the edge switch directed towards the core 
level form a single bundle B = -Pe c ,- • Line 21 
handles this scenario using the min function. 

ii. After determining B, we calculate the num- 
ber of core switches C. 

iii. At this point, the number of edge and core 
switches becomes known, hence we can cal- 
culate the value of the objective function fij 
for this particular fat-tree configuration. 

(c) We choose the optimal fat-tree configuration: fa — 
min fij (or, alternatively, present a human de- 
signer with several choices) 

4. From all combinations obtained with the previous steps 
(trivial cases 1, 2) and main loop (3), we choose the 
one with the optimal value of the objective function 
(line 29). 

Example 1. Suppose we need to interconnect N = 60 nodes 
using 36-port switches (Pe ~ Pc = 36) with a non-blocking 
network (Bl = 1). The algorithm would return E — 4 edge 
switches, C = 2 core switches, and B = 9 links in a bundle. 

The wiring diagram for the resultant network is shown on 
Figure 1. Note that on the rightmost edge switch only 6 
ports are utilized, and 12 ports are left unused. Lines be- 
tween edge and core switches are thicker to represent mul- 
tiple links connecting switches in edge and core layers. In 
this case, bundles of B — 9 links are used. Running links 
in bundles allows for greater maintainability. Additionally, 
using 12x InfiniBand cables that aggregate three 4x links 
allows to decrease the number of physical cables in a bundle 
to only three. 

Example 2. Let us design a network for N — 1200 nodes, 
using a blocking factor of Bl — 2. We will use edge switches 



with Pe = 36 ports and core switches with Pc ~ 108 ports. 
Out of 36 ports on the edge switch, Peu = 24 will be con- 
nected to compute nodes, and the remaining Pec = 12 ports 
will be connected to the core layer. This provides the block- 
ing factor of Bl = 24/12 = 2. The algorithm would return 
E = 50 edge switches, C — 6 core switches, and B — 2 links 
in a bundle. 

Example 3. Let us now design a network for N — 280 nodes 
with an artificially high blocking factor Bl = 11, using 36- 
port switches (such a configuration is unsuitable for HPC 
workloads, and is more relevant for generic data centre en- 
vironments). The algorithm would distribute ports on edge 
switches in the following way: Peti = 33 ports will be con- 
nected to compute nodes, and Pe c — 36 — 33 = 3 ports 
will be connected to the core level. The resulting blocking 
factor is Bl r = 33/3 = 11. The number of edge switches is 
E = 9. Only three ports on edge switches are available for 
connecting to the core level, therefore they will form a single 
bundle: B — 3. The number of core switches is then C = 1. 

Note that, when the number of edge switches E is deter- 
mined, there are two possible scenarios of connecting com- 
pute nodes to edge switches: (1) connect as many nodes 
to each switch as possible, and leave the last switch under- 
utilized (see Fig. 1), or (2) distribute compute nodes uni- 
formly between all switches. The latter scenario can, in very 
rare cases, lead to a lower number of required core switches. 
However, it also leads to difficulties when expanding the net- 
work, because there may not be enough free space in racks. 
Our software tool [15] that implements the algorithm actu- 
ally checks for that condition, and uses this scenario only if 
it provides hardware savings and if the user didn't specify 
preference for expandability. 

4. DISCUSSION 

4.1 Switch configurations 

We consider two types of switches. First is an ordinary com- 
modity non-modular switch with some redundant compo- 
nents. An example would be a typical off-the-shelf 36-port 
InfiniBand FDR switch, with redundant fans and an op- 
tionally redundant power supply, but with a non-redundant 
management board. 

The second type is the modular switch. Consider, for exam- 
ple, a 144-port InfiniBand switch, equipped with 9 line cards, 
each allowing to connect 16 nodes in a non-blocking config- 
uration. Fabric boards are used to provide internal fabric of 
the switch, and all four fabric boards must be installed to 
make the switch non-blocking. Line cards and fabric boards 
are installed into the chassis, which also contains redundant 
power supplies, redundant fans, and redundant management 
boards. 

(Such a switch itself contains a two-level fat-tree, with links 
between core and edge levels running in bundles of B — 4 
and implemented as traces on its backplane printed circuit 
board. Parameters of this fat-tree network are as follows: 
Pe = 32, P c = 36, E = 9, C = 4). 

Modular switches have large port counts, this reduces the 
overall number of switches in the network, improving man- 
ageability and simplifying cable routing. Additionally, they 




Figure 1: Network of TV = 60 nodes, created by the proposed algorithm. Thick lines between edge and core 
levels represent bundles of nine network links. 



allow for future expandability by adding more line cards 
when required. These benefits often outweigh their higher 
prices per port. 

When the number of nodes to be interconnected is lower 
than the number of ports provided by the fully configured 
modular switch, a reduced configuration can be used, with 
fewer line cards installed. This allows to significantly de- 
crease the cost of the switch compared to the full configura- 
tion. 

Different reduced configurations are treated as different mod- 
els of switches in the databases E and C, because they have 
differing technical and economic characteristics. 

4.2 Design constraints and objective functions 

Objective functions can be diverse, and various design con- 
straints can be specified. Let us confine ourselves to a single 
example. Suppose we need to choose one of the following 
switches for the core level: (a) 144-port fully configured 
modular switch, or (b) partially configured 324-port mod- 
ular switch, with 144 configured ports. The former takes 
up less space in a rack, while the latter provides for future 
expansion. 

If constraints on equipment size are imposed, the 144-port 
switch will be used, and the 324-port switch might not be 
even tried if it violates constraints. Conversely, if constraints 
on future expandability are imposed, the 144-port switch 
might get discarded. If no constraints are imposed, exhaus- 
tive search will be performed: the value of the objective 
function (e.g., the total cost of ownership of the network) 
will be calculated for both variants to make the decision. 

4.3 Cable count 

Cables that interconnect edge and core levels are laid out at 
installation time, and later updates are difficult and costly. 

Therefore, when future use of spare ports is anticipated, we 
recommend establishing a full fabric between edge and core 
levels. As we have E edge switches, whose Pec ports are 
connected to core level, we need E ■ Pe c cables to establish 
a full fabric between the layers. Additionally, N cables are 
needed to connect compute nodes to the edge level. 

The complete expression for the number of cables is therefore 
L = N + E ■ Pec- For example, in Figure 1 the number of 



cables is L = 60 + 4 ■ 18 = 132. 

For blade servers, the connection of compute nodes to edge 
switches doesn't require cables, therefore the first summand 
in the above formula is eliminated. 

5. EXPERIMENTAL RESULTS 

We apply the proposed algorithm to a real-life scenario, 
perform detailed calculations and discuss economic impli- 
cations. Prices are subject to change over time, but this 
does not affect generality of conclusions. 

We build a fat-tree network for a cluster of N = 224 blade 
servers. The cluster is built with 14 enclosures, each of them 
containing 16 blade servers. Every enclosure is fitted with 
an edge switch with Pe = 32 ports. 

5.1 Design database 

We will use two possible core switches: (a) a 36-port mono- 
lithic switch with a price of $11,000, which is roughly $306 
per port, and (b) a modular switch which can be configured 
with up to 108 ports, in multiples of 18. In the network de- 
sign tool this modular switch is represented as six switches 
of 18, 36, 108 ports. Therefore, C contains seven items 
in total. 

The modular switch consists of a chassis ($25,000), 3 fabric 
boards necessary to make the switch non-blocking ($9,000 
each), and a required number of line cards (up to 6), pro- 
viding 18 ports each ($13,000 each). Full configuration costs 
$130,000, or $1,204 per port. This is a four times higher 
price per port compared to a simple 36-port switch. 

Modular switches have a lower port density, therefore using 
them can unexpectedly increase the total space taken by 
network equipment. If only limited space is available, this 
can be dealt with by imposing constraints on equipment size 
when running the algorithm. 

5.2 Possible core level configurations 

According to the algorithm, on the edge level E = 14 switches 
will be used. On the core level, there are seven possible 
choices of core switches. The least expensive configuration 
of the core level ($88,000) is obtained when using eight 36- 
port monolithic switches. 

Reduced configurations of the modular switch with 18, 36, 



54 and 72 ports result in unreasonably high costs, and we 
don't analyse them here. They could also be discarded using 
the following heuristic: modular switches are cost-effective 
when configured close to their full capacity. 

Of special interest are, however, configurations with 90 and 
f08 ports. Both of them require C = 3 core switches. The 
90-port configuration will be used, as it is slightly cheaper 
($117,000) than the full 108-port configuration ($130,000). 
The cost of core level with this configuration is therefore 
3 • $117,000 = $351,000. This is roughly 4 times more ex- 
pensive than with a 36-port switch. 
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Figure 2: Network cost, estimated and actual. 



5.3 Factoring in other costs 

We continue to compare two configurations of the core level 
- with 36-port switches and with 90-port switches. Let us 
now factor in the cost of 14 edge-level switches, located in 
enclosures (one switch per enclosure, $11,000 each), and the 
cost of 224 cables (as per section 4.3) of $80 each (an av- 
eraged price for cables of this length, calculated manually). 
The per-port total costs of the two networks are $1,160 and 
$2,334, respectively - a twofold difference. 

If we further add the cost of blade servers, equipped with 
dual CPUs, memory and InfiniBand adapters ($9,600 per 
each server), and cost of 14 enclosures ($7,500 per each), we 
will receive the total costs of the computer cluster, $2,515,320 
and $2,778,320, for networks made of monolithic and modu- 
lar core switches, respectively. The difference per connected 
server diminishes to 10,4%. It means that for a small pre- 
mium we can attain a possibility of future network expansion 
and greatly simplify cabling. 

Additionally, these calculations demonstrate that using block- 
ing networks, such as thin-trees, will only marginally reduce 
total cost of the supercomputer, while potentially having 
severe consequences on performance. 

It is worth noting that cost of cables is very low compared to 
the cost of entire computer cluster. This justifies the use of 
rough approximations of cable costs when designing entire 
computers (see also section 6.3). 

6. OTHER DESIGN CONSIDERATIONS 
6.1 Per-port metrics 

Let us consider a particular case of a network where edge 
and core switches are identical, and all ports are occupied. 
Some useful metrics can be derived for such networks. 

Let us denote port count on edge and core switches by P. 
The network can connect a maximum of TV = N max = P 2 /2 
nodes. There will be P edge and P/2 core switches, for a to- 
tal of 3P/2 switches. As switches have P ports each, the to- 
tal number of ports on all switches will equal to 3P 2 /2 = 37V, 
which is thrice the number of nodes. In other words, for each 
of TV connected nodes, the two-layer fat-tree network em- 
ploys three ports (and a three-layer network employs five). 

Several important characteristics, such as network cost and 
power consumption, are "additive" in a sense that x identical 
switches cost x times more than a single switch and consume 
x times more power. The same applies on a per-port level: 



a network of identical arbitrarily connected switches, with 
a total port count of y, costs y times more and consumes y 
times more power than per-port cost and power consump- 
tion, respectively. 

This allows to easily determine a rough estimation of cost, 
power, rack space, weight and possibly other characteristics 
of a network that supports TV nodes by simply multiplying 
corresponding per-port characteristics of switches by 3TV, 
without the need for detailed analysis. 

For example, a 36-port switch mentioned in section 5.1 has a 
cost of $306 per port. Typical power consumption reported 
by manufacturer is 152W with copper cables, which is 4,22W 
per port. The switch occupies 1U of rack space, hence "per- 
port rack space" is 1/36. 

For a full configuration of TV = 648 nodes, power consump- 
tion is 37V times per-port consumption (8,204W). Cost is 
37V times per-port cost ($594,864). Occupied rack space is 
37V times "per-port rack space" and equals 54U. 

This estimation is also correct when the number of nodes 
TV is X times smaller than N ma x, where X is a non-trivial 
factor of P/2. In this case links between core and edge layers 
run in bundles of X. For example, if P — 36, valid values 
for X are 2, 3, 6 and 9. The estimation will thus be accurate 
for clusters of 324, 216, 108 and 72 nodes. 

In all other cases there will be spare ports on core and possi- 
bly edge layers, and the above approach will systematically 
underestimate metric values, because the actual network will 
have more than 3TV ports. Therefore, the estimation pro- 
vides the lower bound on metric values. 

Figure 2 provides an example. The blue line represents the 
actual cost of network built with 36-port switches, including 
cables, for TV € [2, 160] nodes. A region of 2 < TV < 36 
nodes represents a trivial case of star network, with only 
one switch: no fat-tree is required, hence the cost of network 
is kept low. Starting from 37 nodes, a two-layer fat-tree is 
used. Stepped behaviour of the blue curve is explained by 
increased switch count for every additional P/2 = 18 nodes. 
Monotonic increase inside a step is caused by increased cable 
count for every connected node. 

The green line starts at 37-th node and represents the above 
estimation: 3TV multiplied by per-port cost, plus the cost of 
cables. At 72 and 108 nodes it exactly matches the actual 
cost, as discussed above, but in other points a discrepancy 
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Figure 3: Equipment placement for partly expand- 
able configuration. 
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Figure 4: Close-up of the area of interest. 

is observed, with the median value of 12%. 

This result allows to quickly obtain engineering evaluations 
of fat-tree characteristics without referring to the algorithm. 

6.2 Designing for future expansion 

Expanding existing fat-tree networks can be a difficult task. 
While adding edge level switches is easy, core level switches 
might not have spare ports necessary for expansion if this 
was not taken care of during design phase. We suggest to de- 
sign a core level for the largest anticipated number of ports, 
and then gradually connect more nodes via additional edge 
switches as the need arises. 

We demonstrate with a real-life scenario that failure to prop- 
erly construct the core level can lead to non-expandable net- 
works. Suppose we need to build a computer cluster using 
1U compute nodes and commodity switches with P — 36 
ports. Additional design constraint is that the cluster shall 
initially occupy two standard 42U racks with as many nodes 
as possible, and be expandable to three racks in the future 
- imitating lack of space in the machine room. 

If we do not take expandability into account, the network 



design process proceeds as follows. Two racks contain 84U 
of space. With TV = 84 nodes we require E = 5 edge and 
C = 3 core switches which would occupy additional 8U, 
thus exceeding allotted space. Hence we reduce node count 
to TV = 76. 

Resulting equipment placement is presented in Figure 3, 
with a close-up of the area of interest in Figure 4. 

Initially there are two racks, the left and the middle. Five 
edge switches are located in the top of the middle rack, fol- 
lowed by three core switches. All remaining space is occu- 
pied by TV = 42 + 34 = 76 compute nodes. 

Let us discover opportunities for expandability when the 
third rack becomes available. First, we can use spare ports 
in already installed edge and core switches. There are 28 
spare ports in the topmost edge switch, shown in cyan, and 
32 spare ports in three core switches, shown in cyan and 
magenta. 

Using 28 spare ports in the edge switch allows to connect 
14 more nodes which would be placed to the bottom of the 
newly available right rack, denoted by a cyan block. Of 28 
ports, half would be connected to the new nodes, and the 
remaining half would be connected to corresponding cyan 
ports of core switches. 

Next opportunity for expansion lies in installing a sixth edge 
switch in the top of the right rack, shown in magenta. 18 
ports of this switch will be connected to 18 nodes in the 
right rack, denoted by a magenta block. Remaining 18 ports 
would be connected to corresponding magenta ports of core 
switches. 

Now opportunities for expansion are exhausted. A total of 
TV = 108 nodes were connected, and nine units of rack space 
cannot be utilized, violating the design constraint. Further 
expansion would require redesigning the core layer: adding 
more switches and rewiring connections. For large clusters 
this is a complex and costly task that should be avoided if 
possible. 

Instead, we can design a network from the start to accom- 
modate the largest anticipated number of nodes. In this 
case, three racks can house TV = 126 nodes. The network 
will consist of E = 7 edge and C = 4 core switches, occu- 
pying in total 11U of rack space. Hence node count shall 
be respectively reduced to TV = 115. All three racks will be 
fully populated. 

This allows to expand the cluster by 7 additional nodes, 
compared to the previous variant. However, this also incurs 
increased network cost, as 11 switches are used instead of 9, 
so design decisions have to be carefully balanced. 

For this newly designed expandable network, there are two 
alternative variants of equipment installation in the initial 
two racks: 

1. Install all C = 4 core and E = 7 edge switches at once. 
This requires 11U of racks space, and leaves space for 
73 nodes. Additional 42 nodes will be added when a 



new rack is available. 

2. Install all C = 4 core switches and as many edge 
switches as required to fill up two racks with nodes, 
namely, E = 5 edge switches. This requires 9U of 
rack space, and allows to install 75 nodes. Additional 
40 nodes and two edge switches will be added when a 
new rack is available. 

The latter variant allows to populate initial two racks with 
more nodes and reduce original investment in edge switches, 
as their procurement can be delayed until the expansion 
stage. 

6.3 Placing equipment and calculating cable 
costs 

When designing networks using the proposed algorithm, the 
cost of switches is immediately known, as it follows from the 
number of switches. The other part - the cost of cables - is 
not instantly available, because it depends on cable length 
which is not known yet. 

We saw in section 5.3 that cables represent a small portion 
of cost of the entire computing system. For draft designs it 
is beneficial to use a "typical" cable cost as an approxima- 
tion. For final designs, precise cable lengths (which drive 
procurement decisions) can be obtained after routing phase, 
although this is a computationally expensive procedure. 

Moreover, cable routing depends on such input parameters 
as rack sizes, server sizes, machine room dimensions and 
other factors. As a result, routing only becomes possible 
after positions of equipment in racks and placement of racks 
on the floor plan becomes known. Therefore we suggest to 
postpone cable routing until the final stages of the design 
process. 

A straightforward algorithm by Mudigonda et al. [10] can be 
employed to perform cable routing. This algorithm relies on 
a prior placement of nodes and switches into racks in such a 
manner that minimizes total cable length. Such positioning 
resembles a knapsack problem, so a number of heuristics 
were introduced in the cited paper. 

As the network design algorithm proposed in our article uses 
switches of differing sizes, we also propose certain heuristics. 

We show below that fat-trees are perfectly suitable for pack- 
ing racks as densely as possible, or for leaving as much blank 
space in every rack as desired (e.g., for other equipment), 
and also allow for a smooth transition between these ex- 
treme cases. After running the network design algorithm, 
the number and types of required switches are obtained, 
hence the size they occupy becomes available. The same is 
true for compute nodes. Therefore the number of racks re- 
quired to house the equipment becomes known. Assuming 
that machine room dimensions are specified, racks can be 
placed according to mechanical and cooling requirements. 
Then the equipment placement stage can begin, using the 
following heuristics. 

1. Modular switches are physically indivisible and should 



be placed first. Otherwise one may end up with par- 
tially filled racks, with no space in any of them enough 
to house relatively big modular switches, and new empty 
racks would be required. 

2. Space for other indivisible equipment can be reserved 
in the same manner. 

3. As bundles of cables run from all edge switches to all 
core switches, the placement of core switches becomes 
important. They could be placed in the geometrical 
centre of a machine room, or put uniformly around 
the room, or otherwise. This requires further investi- 
gation. 

4. The principal "building block" of a fat-tree network is 
an edge switch and nodes connected to it. It is logical 
to keep this switch and its nodes in the same rack 
(with blade servers, this occurs by itself). They are 
connected with relatively short cables. One or more 
such building blocks are put in a rack, until one of the 
following conditions occurs: 

(a) The remaining free space in the rack cannot ac- 
commodate another building block; 

(b) Adding more blocks would exceed the weight bud- 
get of the rack stipulated by the floor load limit; 

(c) Adding more blocks would exceed the power con- 
sumption of the equipment in the rack above ca- 
pacity of the power supply or cooling systems 

We then calculate the remaining budget of all three 
characteristics (space, weight and power) for the cur- 
rent rack, estimate the number of compute nodes that 
can be placed in this rack at a later time, and proceed 
to the next rack. 

5. The previous step - placing building blocks into racks - 
is repeated multiple times, until we have enough semi- 
filled racks to accommodate the next building block by 
"spreading" it among these racks. 

This strategy allows to fill racks as densely as possible, 
but results in irregular wiring patterns, as noted by 
other researchers. However, minimizing the number 
of racks not only saves fioorspace, it also allows to 
decrease the length of cables running between distant 
racks. If there is no goal to save space, this heuristic 
could be omitted. 

Example. When using commodity 1 U InfiniBand switches 
with P — 36 ports and 1 U compute nodes, a building 
block consists (in the case of a non-blocking network) 
of one switch and 18 nodes, for a total of 19V. Two 
such blocks can be placed in a standard 42U rack, and 
4U of blank space will remain. After filling five racks 
in this manner, the resultant 20U of space are enough 
to house the next 19U block. 

6. Although an edge switch and its nodes should prefer- 
ably be kept in a single rack, they don't necessarily 
have to be adjacent. In fact, it is recommended to 
place edge switches as close to the top of rack as al- 
lowed by cables that go down to nodes. This decreases 
the length of cable bundles that run between racks, 
potentially enabling the use of shorter cables. 
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Figure 6: Top view of 14 racks filled using the pro- 
posed heuristic (see description in text). 



Figure 5: Front view of four racks filled using the 
proposed heuristic (see description in text). 

7. Racks in a row are filled until the end of a row in a 
machine room is encountered. The next rack to be 
filled is chosen across the aisle that separates rows. 

This behaviour ensures that parts of the next block 
to be spread among several racks will remain close to 
each other. 

Placement proceeds in a serpentine pattern, until all 
equipment is placed. These heuristics are general enough 
to enable placement of compute nodes and switches of 
differing physical sizes into racks of differing heights. 

After placement is complete, the routing algorithm is run, 
and cable lengths and costs are obtained. A weighted aver- 
age of cable prices can then be used as an average cable price 
for a cluster of that size (a bigger cluster will need more 
racks and thus will have a greater average cable length). 
Throughout this article, an average cable price of $80 was 
used, obtained using a similar procedure. 

We present an example of using this heuristic on Figure 5, 
deliberately demonstrating filling the racks as densely as 
possible. In this example, we need to design a non-blocking 
fat-tree network for TV = 396 compute nodes, using switches 
with P — 36 ports. This configuration requires E — 22 edge 
and C = 18 core switches. 

The large indivisible blocks of equipment are installed first 
(red). Then, we install blocks of compute nodes (pale green), 
and edge switches for the blocks are put to the top of corre- 
sponding racks (bright green). Finally, after installing seven 
such blocks into four racks, we have enough spare space to 
place one more block of compute nodes (pale yellow), how- 
ever, this time it has to be "spread" among all four racks. 
The exact location of the corresponding edge switch for this 
block (bright yellow) can be chosen arbitrarily. Two units 
of space in the rightmost rack are empty and will be used 
further. 



The top view of 14 racks, arranged in two rows, is presented 
on Figure 6. First four racks from the upper row correspond 
to those on Figure 5. After the indivisible equipment is 
placed, we place 19 out of 22 blocks of compute nodes and 
their edge switches (green). (The 18 core switches are placed 
arbitrarily in this example, in two blocks, denoted by violet 
colour) . 

By now, 11 racks have been occupied, and there is enough 
spare space left to accommodate the remaining three blocks 
of compute nodes, which will have to be "spread" in the 
spare space. The first two of them, denoted by yellow and 
cyan, are spread along four racks. The last block, denoted 
by magenta, spreads along five racks and crosses the aisle 
that separates rows. This dense placement approach is not 
very elegant; however, in this particular example it allows 
to use 11 racks instead of 12. 

7. CONCLUSIONS AND FUTURE WORK 

We present the algorithm to automatically design two-layer 
fat-tree networks with arbitrary blocking factors. We apply 
proposed algorithm to design several networks and analyse 
their characteristics. We demonstrate that a lower bound 
(and a rough approximation) for many technical and eco- 
nomic characteristics of the whole network can be easily 
obtained from per-port metrics. We also discuss issues of 
expandability of fat-trees and propose placement heuristics. 

Directions of future research include further exploration of 
reduced tree topologies [11, 6] for large supercomputers and 
estimating savings in capital and operating expenditure. 
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